SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints

[arXiv] [Project Page] [Dataset]

Jianhong Bai^1*, Menghan Xia^2†, Xintao Wang², Ziyang Yuan³, Xiao Fu⁴,
Zuozhu Liu¹, Haoji Hu¹, Pengfei Wan², Di Zhang²
(*Work done during an internship at KwaiVGI, Kuaishou Technology †corresponding author)

¹Zhejiang University, ²Kuaishou Technology, ³Tsinghua University, ⁴CUHK.

ICLR 2025

Important Note: This open-source repository is intended to provide a reference implementation. Due to the difference in the underlying T2V model's performance, the open-source version may not achieve the same performance as the model in our paper.

🔥 Updates

[2025.04.15]: Please feel free to explore our subsequent work, ReCamMaster.
[2025.04.15]: Update a new version of the SynCamVideo Dataset.
[2025.04.15]: Release the training and inference code, model checkpoint.
[2024.12.10]: Release the project page and the SynCamVideo Dataset.

📖 Introduction

TL;DR: We propose SynCamMaster, an efficient method to lift pre-trained text-to-video models for open-domain multi-camera video generation from diverse viewpoints. We also release a multi-camera synchronized video dataset rendered with Unreal Engine 5.

teaser_video_compressed.mp4

⚙️ Code: SynCamMaster + Wan2.1 (Inference & Training)

The model utilized in our paper is an internally developed T2V model, not Wan2.1. Due to company policy restrictions, we are unable to open-source the model used in the paper. Consequently, we migrated SynCamMaster to Wan2.1 to validate the effectiveness of our method. Due to differences in the underlying T2V model, you may not achieve the same results as demonstrated in the demo.

Inference

Step 1: Set up the environment

DiffSynth-Studio requires Rust and Cargo to compile extensions. You can install them using the following command:

curl --proto '=https' --tlsv1.2 -sSf [https://sh.rustup.rs](https://sh.rustup.rs/) | sh
. "$HOME/.cargo/env"

Install DiffSynth-Studio:

git clone https://github.com/KwaiVGI/SynCamMaster.git
cd SynCamMaster
pip install -e .

Step 2: Download the pretrained checkpoints

Download the pre-trained Wan2.1 models

cd SynCamMaster
python download_wan2.1.py

Download the pre-trained SynCamMaster checkpoint

Please download from huggingface and place it in models/SynCamMaster/checkpoints.

Step 3: Test the example videos

python inference_syncammaster.py --cam_type "az"

We provide several preset camera types. Additionally, you can generate new camera poses for testing.

Training

Step 1: Set up the environment

pip install lightning pandas websockets

Step 2: Prepare the training dataset

Download the SynCamVideo dataset.
Extract VAE features

CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python train_syncammaster.py   --task data_process   --dataset_path path/to/the/SynCamVideo/Dataset   --output_path ./models   --text_encoder_path "models/Wan-AI/Wan2.1-T2V-1.3B/models_t5_umt5-xxl-enc-bf16.pth"   --vae_path "models/Wan-AI/Wan2.1-T2V-1.3B/Wan2.1_VAE.pth"   --tiled   --num_frames 81   --height 480   --width 832 --dataloader_num_workers 2

Generate Captions for Each Video

You can use video caption tools like LLaVA to generate captions for each video and store them in the metadata.csv file.

Calculate the availble sample list

python generate_sample_list.py

Step 3: Training

CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python train_syncammaster.py   --task train   --output_path ./models/train   --dit_path "models/Wan-AI/Wan2.1-T2V-1.3B/diffusion_pytorch_model.safetensors"   --steps_per_epoch 8000   --max_epochs 100   --learning_rate 1e-4   --accumulate_grad_batches 1   --use_gradient_checkpointing  --dataloader_num_workers 4

We do not explore the optimal set of hyper-parameters and train with a batch size of 1 on each GPU. You may achieve better model performance by adjusting hyper-parameters such as the learning rate and increasing the batch size.

Step 4: Test the model

python inference_syncammaster.py --cam_type "az" --ckpt_path path/to/the/checkpoint

📷 Dataset: SynCamVideo Dataset

1. Dataset Introduction

TL;DR: The SynCamVideo Dataset is a multi-camera synchronized video dataset rendered using Unreal Engine 5. It includes synchronized multi-camera videos and their corresponding camera poses. The SynCamVideo Dataset can be valuable in fields such as camera-controlled video generation, synchronized video production, and 3D/4D reconstruction. The camera is stationary in the SynCamVideo Dataset. If you require footage with moving cameras rather than stationary ones, please explore our MultiCamVideo Dataset.

syncamvideo.mp4

The SynCamVideo Dataset is a multi-camera synchronized video dataset rendered using Unreal Engine 5. It includes synchronized multi-camera videos and their corresponding camera poses. It consists of 3.4K different dynamic scenes, each captured by 10 cameras, resulting in a total of 34K videos. Each dynamic scene is composed of four elements: {3D environment, character, animation, camera}. Specifically, we use animation to drive the character, and position the animated character within the 3D environment. Then, Time-synchronized cameras are set up to render the multi-camera video data.

3D Environment: We collect 37 high-quality 3D environments assets from Fab. To minimize the domain gap between rendered data and real-world videos, we primarily select visually realistic 3D scenes, while choosing a few stylized or surreal 3D scenes as a supplement. To ensure data diversity, the selected scenes cover a variety of indoor and outdoor settings, such as city streets, shopping malls, cafes, office rooms, and the countryside.

Character: We collect 66 different human 3D models as characters from Fab and Mixamo.

Animation: We collect 93 different animations from Fab and Mixamo, including common actions such as waving, dancing, and cheering. We use these animations to drive the collected characters and create diverse datasets through various combinations.

Camera: To enhance the diversity of the dataset, each camera is randomly sampled on a hemispherical surface centered around the character.

2. Statistics and Configurations

Dataset Statistics:

Number of Dynamic Scenes	Camera per Scene	Total Videos
3400	10	34,000

Video Configurations:

Resolution	Frame Number	FPS
1280x1280	81	15

Note: You can use 'center crop' to adjust the video's aspect ratio to fit your video generation model, such as 16:9, 9:16, 4:3, or 3:4.

Camera Configurations:

Focal Length	Aperture	Sensor Height	Sensor Width
24mm	5.0	23.76mm	23.76mm

3. File Structure

SynCamVideo-Dataset
├── train
│   └── f24_aperture5
│       ├── scene1    # one dynamic scene
│       │   ├── videos
│       │   │   ├── cam01.mp4    # synchronized 81-frame videos at 1280x1280 resolution
│       │   │   ├── cam02.mp4
│       │   │   ├── ...
│       │   │   └── cam10.mp4
│       │   └── cameras
│       │       └── camera_extrinsics.json    # 81-frame camera extrinsics of the 10 cameras 
│       ├── ...
│       └── scene3400
└── val
    └── basic
        ├── videos
        │   ├── cam01.mp4    # example videos corresponding to the validation cameras
        │   ├── cam02.mp4
        │   ├── ...
        │   └── cam10.mp4
        └── cameras
            └── camera_extrinsics.json    # 10 cameras for validation

3. Useful scripts

Data Extraction

tar -xzvf SynCamVideo-Dataset.tar.gz

Camera Visualization

python vis_cam.py

The visualization script is modified from CameraCtrl, thanks to their inspiring work.

🤗 Awesome Related Works

Feel free to explore these outstanding related works, including but not limited to:

GCD: synthesize large-angle novel viewpoints of 4D dynamic scenes from a monocular video.

CVD: multi-view video generation with multiple camera trajectories.

SV4D: multi-view consistent dynamic 3D content generation.

Additionally, check out our "MasterFamily" projects:

ReCamMaster: re-capture in-the-wild videos with novel camera trajectories.

3DTrajMaster: control multiple entity motions in 3D space (6DoF) for text-to-video generation.

StyleMaster: enable artistic video generation and translation with reference style image.

Acknowledgments

We thank Jinwen Cao, Yisong Guo, Haowen Ji, Jichao Wang, and Yi Wang from Kuaishou Technology for their invaluable help in constructing the SynCamVideo-Dataset. We thank Guanjun Wu and Jiangnan Ye for their help on running 4DGS.

🌟 Citation

Please leave us a star 🌟 and cite our paper if you find our work helpful.

@article{bai2024syncammaster,
  title={SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints},
  author={Bai, Jianhong and Xia, Menghan and Wang, Xintao and Yuan, Ziyang and Fu, Xiao and Liu, Zuozhu and Hu, Haoji and Wan, Pengfei and Zhang, Di},
  journal={arXiv preprint arXiv:2412.07760},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
diffsynth		diffsynth
example_test_data		example_test_data
models/SynCamMaster/checkpoints		models/SynCamMaster/checkpoints
.gitignore		.gitignore
README.md		README.md
download_wan2.1.py		download_wan2.1.py
generate_sample_list.py		generate_sample_list.py
inference_syncammaster.py		inference_syncammaster.py
requirements.txt		requirements.txt
setup.py		setup.py
train_syncammaster.py		train_syncammaster.py
vis_cam.py		vis_cam.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints

[arXiv] [Project Page] [Dataset]

🔥 Updates

📖 Introduction

⚙️ Code: SynCamMaster + Wan2.1 (Inference & Training)

Inference

Training

📷 Dataset: SynCamVideo Dataset

1. Dataset Introduction

2. Statistics and Configurations

3. File Structure

3. Useful scripts

🤗 Awesome Related Works

Acknowledgments

🌟 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

KwaiVGI/SynCamMaster

Folders and files

Latest commit

History

Repository files navigation

SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints

[arXiv] [Project Page] [Dataset]

🔥 Updates

📖 Introduction

⚙️ Code: SynCamMaster + Wan2.1 (Inference & Training)

Inference

Training

📷 Dataset: SynCamVideo Dataset

1. Dataset Introduction

2. Statistics and Configurations

3. File Structure

3. Useful scripts

🤗 Awesome Related Works

Acknowledgments

🌟 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages